-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OpenMP thread allocation in Linux #5551
Conversation
@guolinke @shiyu1994 Just checking in for updates. We are awaiting this fix to unblock a release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Just left two quick questions. Could you please help to address? I'll make this merged as soon as possible.
@svotaw will force set |
Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT? However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results. |
1 similar comment
Not sure what you are suggesting. Are you suggesting whether that would fix it in my particular case? Or for everyone? I'm not sure we should force a fixed number of threads (and as low as 8). And according to the docs, you can override this by the local annotation in some cases. I'm not sure we should rely on an environment variable being set to fix a bug. Or are you suggesting using OMP_NUM_THREADS env var first instead of OMP_THREAD_LIMIT? However, in investigating your question, I see an env variable for OMP_DYNAMIC, and it defaults to true. I will test what happens when that is false. That might be our issue, in the sense the multiple threads are maybe allowed to adjust their own max OpenMP threads. If so, maybe that will at least solve it in our particular case for now, but I'd still suggest that we find a permanent solution that does not rely on env var settings. Will reply with results. |
okay, I see. Another solution I suggest is to dynamically increase the buffer, when |
Correct, I didn't want to adversely affect a shared critical pathway with an extra comparison, so I would rather avoid using dynamic buffer size. Also, without a fixed size, you can't generate deterministic thread ids across threads, so it ends up having thread collisions anyways. I'm taking a better look at the OpenMP APIs now that I understand the issue better and you guys have pointed out some good responses. Should have something up today. |
After a more thorough investigation, here's what I found with OpenMP: As a reminder, what we need is for a constant that represents how many OpenMP threads each external thread will create. With this constant, we can pre-allocate sets of buffers (i.e., 1 set of X buffers for each of Y calling threads, and we need to find X and are given Y). The ideal candidate would be I tried all the APIs that OpenMP has to tweak thread counts: OMP_DYNAMIC, OMP_THREAD_LIMIT, OMP_NUM_THREADS, etc.
Here's my thoughts for a flexible solution that is better than the static constant from the first iteration of this PR:
See the latest iteration for an example of this suggestion. We can still debate details of caching, naming, location of env var utils, etc., but this works from the more extensive testing I did. It works out-of-the-box, and still is flexible for corner cases. |
@guolinke Actually, having looked at it and thought about it for another day, it might be better to choose another way to set this value. Since this isn't really an OpenMP setting, perhaps it doesn't make sense to set it in openmp_wrapper. Also, I don't really like the static var. I can think of 2 other options:
I will implement #1 in another iteration when I get a chance. |
I think this is the best fix, although of course you are free to comment. I believe it's ready for checkin, other than your comment on whether we should use 8 or 16 as a default (see TODO in dataset.h). Note the failing R test is most likely a flake. |
It seems that some cpp tests are failing. Will look into this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, LGTM!
Let us fix the CI next |
Thanks for the signoff and feedback! Just to be clear, is this a general CI problem, or something related to this PR? (I assume the former, but making sure) |
Please update to latest To be clear, we never merge PRs in this repo where CI is failing, even if the failures seem unrelated to the PR's changes. |
sure, I guess I meant is this something I need to merge a fix for or something that the backend pipeline just needs to be re-run? I merged from master. |
After sync with main, seeing the below error in a few R tests. I will try a re-run to confirm it's real. The downloaded binary packages are in |
I fixed all the tests, except for the R tests which are failing due to MikTeX failing somehow. If that's something related to this PR, can someone point me to a possible cause? I don't know anything about MikTeX. |
That looks unrelated to this PR's changes, as we're seeing it elsewhere on PRs and |
@jameslamb Can this be checked in? |
I'll merge this, based on @guolinke 's approval. Normally I'd label a change like this, which adds a new required argument to a public API in Thanks for the help! |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
For the streaming push APIs that are designed to work multi-threaded, we observed consistent failures in Linux. (e.g LGBM_DatasetPushRowsByCSRWithMetadata). These APIs rely on pre-allocating batches of sparse push arrays to avoid thread collisions. To do the initial allocation during
LGBM_InitStreaming
, we were usingOMP_NUM_THREADS()
. This works just fine on Windows, where tested.However, in Linux, it appears that each calling thread uses its own space to determine OpenMP thread statistics. Neither OMP_NUM_THREADS() nor omp_max_num_threads() return the same value for different calling threads.
To fix this, this PR changes to use a static MAX_THREAD_ALLOCATION to simply pre-allocate based on a known constant thread space. This wastes a small amount of memory due to overallocation (empty vectors), but testing confirms that it now works to run multi-threaded in Linux as well.
Note that there are multi-threaded unit tests to cover this, but due to the fact that OpenMP is not supported in the C++ unit tests framework, the test threading is implemented manually and hence always succeeds. This issue is only seen when OpenMP is active and we are running in Linux.
Example:
2 external Spark Tasks are calling
LGBM_DatasetPushRowsByCSRWithMetadata
to load a Dataset in parallel. Before pushing, Spark Task thread 0 callsLGBM_InitStreaming
with an expected 2 external calling threads (1 for each Spark Task).OMP_NUM_THREADS()
returns 3 threads, so the sparse bins allocate 6 independent push buffers (2 external threads * 3 OpenMP threads per external thread) to avoid thread collisions.The expected
internal_thread_id
range of Spark Task thread 0 is 0-2 and the range of Spark Task thread 1 is 3-5, and these are used to index the correct thread-safe sparse push buffer.However (in Linux specifically) Spark Task thread 1
OMP_NUM_THREADS()
returns 4, a different value than what Spark Task thread 0 saw during initial allocation (3). This results in Spark Task thread 1 callingLGBM_DatasetPushRowsByCSRWithMetadata
and thinking that it's internal_thread_id range is 4-7 (as opposed to 3-5). This causes a JVM crash since trying to index sparse push buffer 6 or 7 is out of range.Also observed: even if you add the ability to dynamically add push buffers (to avoid out-of-range), you still get indexing collisions. 2 external threads can end up using the same
internal_thread_id
due to the way we calculate it. TheOMP_NUM_THREADS()
is expected to be a constant to generate non-overlapping OpenMP tid ranges, but in Linux it seems to vary depending on which thread called it.